1. Executive summary & goals
This report examines publication and audience engagement trends for Towards Data Science (TDS) articles from the first recorded month in 2010 through mid-2021. It is written for visualization researchers, data‑science students, and Medium/TDS contributors who want a clear picture of how publishing volume, author contributions, and reader engagement (claps, responses, and reading time) have evolved. Our goals are to describe the dataset and cleaning steps, characterize temporal publishing patterns and the rise of paid content, quantify engagement and reading-efficiency patterns, identify highly engaged articles and influential authors, and surface practical recommendations for authors and automated analytics dashboards. Key questions include: how has publication volume and paid-content adoption changed over time, how concentrated is audience attention, how do paid and free articles compare on engagement, and what reading-time windows are associated with stronger audience response.
2. Data, cleaning, and derived metrics
The core dataset contains 46,079 article records (no duplicates after deduplication) with fields including publish_date, author, title, claps, responses, reading_time (minutes), and a normalized paid flag. Cleaning steps included parsing datetimes, converting text blanks to empty strings, treating NaN claps/responses as 0, normalizing paid to a boolean, and removing one row with non-positive reading_time when computing reading-efficiency. Derived features used throughout the report are: engagement_score (claps + responses, used as the primary combined engagement metric), average claps per response (claps divided by max(1,responses)), reading_efficiency (claps per minute = claps / reading_time), monthly counts and paid_ratio (proportion of articles marked paid for each month). Summary diagnostics show heavy skew and long tails: claps mean 266.3 (median 87), responses mean 1.71 (median 0), reading_time mean 7.17 min (median 6), engagement_score mean 283.4 (median 94), and reading_efficiency mean 40.94 claps/min (median 13.5). Outliers are present above the 99th percentiles (e.g., claps > 3100, engagement > 3330) and include a small set of viral articles (top engagement ~54,980) that drive most extreme values; zero-clap articles are rare (~0.7%). These checks confirm strong right skew, a small number of extreme high-engagement articles, and overall data consistency after cleaning.